Data Import: 


R’s tidyverse is built around tidy data stored 
in tibbles, which are enhanced data frames. 


The front side of this sheet shows 
how to read text files into R with 
readr. 


The reverse side shows how to 
create tibbles with tibble and to 
layout tidy data with tidyr. 





OTHER TYPES OF DATA 


Try one of the following packages to import 
other types of files 


e haven - SPSS, Stata, and SAS files 
e readxl - excel files (.xls and .xlsx) 
e DBI- databases 

e Jjsonlite -json 

e xml2-XML 

e httr-Web APIs 

e rvest - HTML (Web Scraping) 


Save Data 


Save x, an R object, to path, a file path, as: 


Comma delimited file 
write_csv(x, path, na = "NA", append = FALSE, 
col_names = !append) 
File with arbitrary delimiter 
write_delim(x, path, delim = " ", na = "NA", 
append = FALSE, col_names = !append) 
CSV for excel 
write_excel_csv(x, path, na = "NA", append = 
FALSE, col_names = !append) 
String to file 
write_file(x, path, append = FALSE) 
String vector to file, one element per line 
write_lines(x,path, na = "NA", append = FALSE) 
Object to RDS file 
write_rds(x, path, compress = c("none", "gz", 
"bz2", "xz"), n) 
Tab delimited files 
write_tsv(x, path, na = "NA", append = FALSE, 
col_names = !append) 


E studio 


. CHEAT SHEET 


Rea d Ta b U la r Data - These functions share the common arguments: 


read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), 
quoted_na= TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, 


n_max), progress = interactive()) 


Comma Delimited Files 




















C 

a,b,c coe read_csv("file.csv") 

123 > 4 5 NA To make file.csv run: 

4,5,NA write_file(x = "a,b,c\n1,2,3\n4,5,NA", path = "file.csv") 

mR ESEJA Semi-colon Delimited Files 

ami > BEE read_csv2("file2.csv") 
12,3 4 5 NA write_file(x = "a;b;c\n1;2;3\n4;5;NA", path = "file2.csv") 
4;5;NA 
Files with Any Delimiter 
alblc ESEIA read_delim("file.txt", delim = "|") 
1 2 3 write_file(x = "alblc\n1|2/3\n4/5|NA", path = "file.txt" 
4|5|NA Fixed Width Files 
read_fwf("file.fwf", col_positions = c(1, 3, 5)) 

abc Aee write_file(x = "a b c\n1 2 3\n4 5 NA’, path = "file.fwf") 

123 4 5 NA r . 

PST Tab Delimited Files 

read_tsv("file.tsv") Also read_table(). 
write_tile(x = "a\tb\tc\n1\t2\t3\n4\t5\tNA" path = "tile.tsv") 
USEFUL ARGUMENTS 

ae Example file afela. Skip lines 

123 write_file("a,b,c\n1,2,3\n4,5,NA","file.csv") ens read_csv(f, skip = 1) 

4,5,NA f <- "file.csv 

A B C No header fs Read in a subset 

A B i ajele 

EINA read_csv(f, col_names = FALSE) 1 2 3 read_csv(f, n_max = 1) 
aM Provide header 

A B C read_csv(f, col_names=c("x","y","z")) EI Missing Values 

1 2 3 NA 2 3 read_csv(f,na=c("1", "")) 
4 5 NA 4 5 NA 


Read Non-Tabular Data 


Read a file into a single string 
read_file(file, locale = default_locale()) 
Read each line into its own string 
read_lines(file, skip =0, n_max =-1L, na = character(), 
locale = default_locale(), progress = interactive()) 
Read Apache style log files 


Read a file into a raw vector 
read_file_raw/(file) 
Read each line into a raw vector 
read_lines_raw/(tile, skip = 0, n_max = -1L, 
progress = interactive()) 


read_log(tile, col_names = FALSE, col_types = NULL, skip = 0, n_max =-1, progress = interactive()) 
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Data types 


readr functions guess 
the types of each column and 

convert types when appropriate (but will NOT 
convert strings to factors automatically). 





A message shows the type of each column in the 
result. 


a — a 
## Parsed with column specification: 


| ## cols( ; 
## age = col_integer(), age Is an 
## sex = col_character(), WUS- 


## earn = col_double() 
## ) 





sexisa f 


character 


earn is a double (numeric) 





1. Use problems() to diagnose problems. 
x<- read_csv("file.csv"); problems(x) 


2. Use acol_ function to guide parsing. 

e col_guess() -the default 

e col_character() 

e col_double(), col_euro_double() 

e col_datetime(format = "") Also 
col_date(format = ""), col_time(format= "") 

e col_factor(levels, ordered = FALSE) 

e col_integer() 

e col_logical() 

e col_number(), col_numeric() 

e col_skip() 

x <- read_csv("file.csv", col_types = cols( 
A=col_double(), 
B=col_logical(), 
C =col_factor())) 


3. Else, read in as character vectors then parse 
with a parse_ function. 


e parse_guess() 
e parse_character() 


e parse_datetime() Also parse_date() and 
parse_time() 


e parse_double() 

e parse_factor() 

e parse_integer() 

e parse_logical() 

e parse_number() 

XSA <- parse_number(xSA) 
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Tibbles-an enhanced data frame Tidy Data with tidyr Split Cells 


Tidy data is a way to organize tabular data. It provides a consistent data structure across ee 





The tibble package provides a new Use these functions to 










































































. ae A table is tidy if: Tidy data: 
S3 class for storing tabular data, the BR ==§ y y A x B — split or combine cells a 
tibble. Tibbles inherit the data frame “SEE A x E ac into individual, isolated : 
class, but improve three behaviors: d & values. Ø 
e Subsetting - | always returns a new tibble, =—_ 
[[ and $ always return a vector. : separate(data, col, into, sep = "[“[:alnum:]] 
. : Each variable is in Each observation, or : Makes variables easy Preserves cases during +", remove = TRUE, convert = FALSE, 
i A a a AN use full its own column case, isin its own row to access as vectors vectorized operations extra = "warn", fill = "warn", ...) 
, l l l Separate each cell in a column to make 
: shine a print a tipble, R provides a Res h a pe Data - change the layout of values in a table several columns. 
data that fits on ee eee Use pivot_longer() and pivot_wider() to reorganize the values of a table into a new layout. CENT re 6 
aces ] “sudi macy “te pivot_longer(data, cols, names_to = "name", pivot_wider(data, id_cols = NULL, names_from = name, eae RISES —— fet GSS — H 
4 aud: a4 3.9 names_prefix = NULL, names_sep = NULL, names_prefix = "", names_sep = "_", names_glue = NULL, - — — - — ak EA 
PT {| ] A 7 aud: ; mad a names_pattern = NULL, names_ptypes = list(), names_sort = FALSE, names_repair = = "check_unique", B 2000 80K/174M B 2000 80K EPZ 
2 aud} a4 quattro 1:8 names_transform = list(), names_repair = values_from = value, values_fill = NULL, values_fn = NULL, ...) C 1999 212KMT C 1999 212K EIF 
a oo eo "“check_unique", values_to = "value", values_drop_na = . l . C 2000 213K/T C 2000 23K EE 
heey ae eee FALSE, values_ptypes = list(), values_transform = pivot_wider() pivots anames_from and a 
i list(), ..-) values_from column into a rectangular field of separate(table3, rate, sep = "/" 
tibble display cells. into — c(' 'cases" "pop ")) 
pivot_longer() pivots cols columns, moving: == abea — y y Ťěġěėůùaa 
137 1999 $ autol t4 column names into a names_to column,and = FES EER country | year | cases | pop | separate_rows(data, ..., sep = "[{[:alnum:]. 
o mantel has column values into a values_to column. 1999 OK 1999 0.7K 19M = u l = 
162 2008 4 manaf n$ 7 1999 EZI om ; 2000 2K 20M +" convert = FALSE) 
Ia G WR A 2000 9 x B 1999 37K 172M 
Ea o country | 1999 | 2000 country] year [cases A 2000 EI 20v B 2000 sok 174M pep a cin ely a miane 
A large table Sete Ss A 0.7K +— -j A EA ok B 1999 37 C 1999 212K 1T several rows. 
: - B 37K 80K B 37K B 1999 oo RA C 2000 NA NA 
to display data frame display C 212K 213K C 2K e 80K -o 
e Control the default appearance with options: A B 2000 EJ 174m I ae . 1999 E 
: = B c 1999 EEJ 212k 
options(tibble.print_max ay - C1900 IR iT ; 2000 2kK/720M => a 1999 BER 
tibble.print_min =m, tibble.width = Inf) B 1999 37K/172M A 2000 2K 
_ E B 2000 80K/174M A 2000 EIE 
- View full data set with View() or glimpse() pivot_longer(table4a, cols = 2:3, Me table2, names_from = type, Sane Ocal MTS B 1999 37K 
. names_to = "year", values_to = "cases" values_from = count C 2000 213K/T B 1999 MAVE 
e Revert to data frame with as.data.frame() 7 EGI 3 ~ 7 ) B 2000 80K 
E A E E ee ee A ee ee ee 7 B 2000 
CONSTRUCT A TIBBLE IN TWO WAYS Handle Missing Values C 1999 212K. 
C 1999 
tibble(. ..) snes drop_na(datea, ...) fill(data, ..., .direction=c("down","up")) —_ replace_na(data, C 2000 218K 
Construct by columns. See - A l C 2000 
tibble(x = 1:3, y = c("a", "b", "c")) make this Drop rows containing Fillin NA’s in... columns with most replace = list(), ...) 
_ oO die NA’s in... columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate, sep = "/") 
tribble(. .) ee eee E E E E E ee ere ee E re E a E a ree 
Construct by rows. BORRIS S : = ETEA ETEA Emea Emra ETEA Emea unite(data, col, ..., sep = "_", remove = TRUE) 
tribble( > Ys oo aan - ! um ae Collapse cells across several columns to 
» q, 2 2 b ME De D 3 Da Ee Deine j 
2 "p" 3 3 r - : E 2 = make a single column. 
3, Well i , 
c') drop_na(x, x2) fill(x, x2) replace_na(x, list(x2 = 2)) | country | century | year | 
as_tibble(x, ...) Convert data frame to tibble. a Afghan oom Afghan 1999 
Afghan Afghan 2000 
enframe(x, name = "name", value = "value") Expa n d Ta b les - quickly create tables with combinations of values Brazil Brazil 1999 
Convert named vector to a tibble : , Brazil Brazil 2000 
E Eas complete(data, ..., fill = list()) expand (data, ...) China China 1999 
is_tibble(x) Test whether x is a tibble. China China 2000 





Adds to the data missing combinations of the Create new tibble with all possible combinations 
values of the variables listed in... of the values of the variables listed in... unite(table5, century, year, 
@ St ų d O complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb) col = "year" sep = "") 
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